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ABSTRACT 

Analytic fiinctions represent the state-of-the-art way of perform- 
ing complex data analysis within a single SQL statement. In par- 
ticular, an important class of analytic functions that has been fre- 
quently used in commercial systems to support OLAP and decision 
support applications is the class of window functions. A window 
function returns for each input tuple a value derived from applying 
a function over a window of neighboring tuples. However, exist- 
ing window function evaluation approaches are based on a naive 
sorting scheme. In this paper, we study the problem of optimiz- 
ing the evaluation of window functions. We propose several effi- 
cient techniques, and identify optimization opportunities that allow 
us to optimize the evaluation of a set of window functions. We 
have integrated our scheme into PostgreSQL. Our comprehensive 
experimental study on the TPC-DS datasets as well as synthetic 
datasets and queries demonstrate significant speedup over existing 
approaches. 

1. INTRODUCTION 

Today's mainstream commercial database systems such as DB2, 
Oracle and SQL Server support analytic functions in SQL to express 
complex analytical tasks. With these analytic functions, common 
analyses such as ranking, percentiles, moving averages and cumu- 
lative sums can be expressed concisely in a single SQL statemen- 
t. More importantly, these functions lead to more efficient query 
processing - analytic queries expressed with analytic functions can 
potentially eliminate self-joins, correlated subqueries and/or use 
fewer temporary tables compared to the counterparts without such 
functions [19, 4]. However, to our knowledge, there were not many 
reported works on optimizing the processing of analytic functions. 

In this paper, we focus on an important class of analytic func- 
tions, called window functions, that was introduced by the SQL:2003 
standard and has been widely used to support OLAP and decision 
support applications. A window function is one of the ranking, 
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reference, distribution and aggregate functions. However, it is e- 
valuated over a (base or derived) table, which can be viewed as the 
union of logical window partitions. Each window partition has a 
single value of WPK, which is a set of attributes {wpki, wpki, • • • , 
wpk m }; each pair of window partitions are disjoint on values of 
WPK. With an empty WPK, the whole table forms a single window 
partition. In addition, tuples of each window partition are ordered 
by WOK, which is a sequence of attributes (woki, uiofe, • • • , wok n ). 
Within a window partition, each tuple has a window of neighboring 
tuples meeting certain criteria. The window function essentially 
calculates and appends a new window-function attribute to each tu- 
ple t, by applying a function over all tuples of the window of t. In 
other words, the evaluation of a window function wf over a table T 
results in a new table T', which contains exactly every tuple of T 
but also has a new attribute whose values are derived by wf . 

A basic window query block can be viewed as a normal SQL 
query Q plus one or more window functions defined in the SELECT 
clause. These window functions are independent of each other. To 
evaluate the window query, all query clauses in Q except ORDER 
BY and DISTINCT are first optimized and executed to derive a 
windowed table, on which these window functions are then in- 
voked. Finally, the ORDER BY clause is imposed to sort the re- 
sultant table containing new window-function attributes to some 
specific order. 

Example 1 To find the rankings of each employee's salary within 
his department as well as the whole company, the corresponding 
window query defines two window functions rank_in_dept and 
globalrank, where the PARTITION BY key represents WPK 
and the ORDER BY key represents WOK: 
SELECT empnum, dept , salary, 

rank ( ) OVER ( PARTITION BY dept ORDER BY 
salary desc nulls last) as rank_in_dept , 
rank ( ) OVER (ORDER BY salary desc nulls 
last) as globalrank 
FROM emptab; 

In this example, the windowed table is the source table emptab. 
The sample output is shown below. □ 
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In current database systems, in principle the window functions 
are evaluated over the windowed table as follows. 

Single window function. Computing a single window function 
over the windowed table involves two logical steps. In the first step, 
the windowed table is reordered by a tuple reordering operation to 
physical window partitions according to the specifications of WPK 
and WOK. The conventional tuple reordering operation is a sort op- 
eration with a sort order given by the concatenation of some WPK's 
permutation and WOK. We shall refer to this reordering operation as 
Full Sort (FS). The generated window partitions are pipelined into 
the second step, where the window function is sequentially invoked 
for each tuple within each window partition. The output table has 
a tuple ordering consistent with the sort order of FS. 

Multiple window functions. A window function chain is formed 
to sequentially evaluate multiple window functions over the win- 
dowed table. The windowed table tuples are fed into the leading 
window function of the chain. For each of the remaining window 
functions, it is evaluated over the reordered output of its preceding 
window function. Figure 1 shows the window function chain for 
Example 1. 
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Figure 1: A conventional window function chain for Example 1 

We note that it is possible for a certain sort order O to satis- 
fy the specifications of several consecutive window functions in 
the chain on their window partitions. Since a window function 
does not change the tuple ordering, to evaluate a sub-chain con- 
sisting of these window functions, it is sufficient to reorder on- 
ly the input of the leading window function in the sub-chain to 
O using FS. For example, consider a window function wfi = 
(WPKi = {a,fe},W0Ki = (c)) followed by another window func- 
tion wf 2 = (WPK 2 = {6},W0K 2 = (a)). If the input of wfi is 
reordered by FS with sort order (6, a, c), then wf 2 can be directly 
evaluated over the output of wfi without further tuple reordering. 
This FS sharing optimization has been adopted by some systems, 
e.g. Oracle [5]. 

In this paper, we re-examine the problem of efficient evaluation 
of a window function, and optimization of a chain of window func- 
tions. For a single window function evaluation, we develop two 
new tuple reordering mechanisms. The first method, called Hashed 
Sort (HS) is based on the key observation that window partition- 
s delivered by a tuple reordering operation can be in an arbitrary 
order without affecting the result correctness and the performance 
of the subsequent window function invocation. Thus, under a HS 
operation, a table is first hashed into buckets made up of complete 
window partitions, and then each bucket is separately sorted to de- 
rive the physical window partitions. HS is expected to be superi- 
or over FS especially when the sorting memory is small, since it 
avoids a full sort. 

The second method, called Segmented Sort (SS), can be ap- 
plied when it is possible to take advantage of the ordering of the 
input table. Specifically, the input of a window function should 
correspond to a sequence of tuple segments such that sorting each 
segment results in the desired window partitions for evaluating the 
window function. As an example, consider a window function 
wfi = (WPKi = {a},W0Ki = (&)) followed by another window 
function wf 2 = (WPK 2 = {a},W0K 2 = (c)) in the chain. If the 
input of wfi is reordered by FS with sort order (a, b) or by HS with 
hash key {a} as well as sort order (a, b), then the output of wfi 



consists of tuple segments each having a single a value. As such, 
it is sufficient to do the tuple reordering for wf 2 by simply sorting 
each segment on (c). Compared with FS and HS, SS is expected 
to incur a much lower cost since it does not need a full sort or table 
partitioning. 

FS and HS have the duality analogous to that of other hash-based 
and sort-based query processing methods [12]. SS looks similar to 
the partial sort operation [7, 8, 13], which produces the required 
complete sort order by exploiting the partial sort order satisfied by 
the input. However, unlike the partial sort whose input and out- 
put both must be totally ordered, SS is more flexible as its input 
and output could be a segmented relation, which be viewed as a 
sequence of relation segments that is partitioned based on some at- 
tributes and sorted on some other attributes. In fact, SS can be 
considered as a generic and flexible extension of the partial sort op- 
eration to reorder tuples for evaluating window functions. To the 
best of our knowledge, we are the first to recognize and study the 
benefits of applying the HS and SS techniques in the context of 
window function evaluation. 

For multiple window functions in a query, finding an optimal se- 
quence of evaluating the window functions turns out to be NP-hard. 
As such, we also propose a cover set-based optimization scheme to 
efficiently generate a window function chain. Our scheme essen- 
tially groups the set of window functions into cover sets such that 
window functions within a cover set incur at most one FS/HS/SS 
reordering operation (for the leading window function in the cover 
set). We present our heuristics to partition the window functions in- 
to cover sets and to order the cover sets for processing. Our scheme 
naturally subsumes the FS sharing optimization. 

Our techniques of window function evaluation can be seamless- 
ly integrated in a typical query optimizer and work in conjunction 
with other complementary optimization methods (e.g., interesting 
orders [16] and parallel execution). We have built a prototype of 
our techniques within PostgreSQL [1], and conducted an extensive 
performance study with the TPC-DS [2] datasets as well as syn- 
thetic datasets and queries. The results showed the effectiveness of 
HS and SS operations over FS, and the near-optimality of our cover 
set-based optimization scheme over existing approaches. 

The rest of this paper is organized as follows. In Section 2, we 
present some notations which will be utilized throughout the paper. 
In Section 3, we elaborate the details of the HS and SS operations. 
In Section 4, we describe our cover set-based optimization scheme 
for the evaluation of multiple window functions over a windowed 
table. In Section 5, we discuss how to incorporate our techniques 
of window function evaluation in an integrated query optimization 
framework. Section 6 validates the effectiveness of our proposed 
techniques. We discuss the related work in Section 7 and finally 
conclude in Section 8. Proofs of technical results are given else- 
where [9] . 

2. PRELIMINARIES 

Let A be a set of attributes; let X and Y be two sequences of at- 
tributes. Besides the standard notations for sets, such as Cardinality 
(||), Subset (C, c), Union (U), Interesection (n) and Complement 
(— ), we also utilize the following notations: 

• A : a permutation of A; 

• |X| : the number of attributes in X; 

• ottr(X): the set of attributes in X; 

• X o Y: the sequence of attributes obtained by concatenating X 
and Y; 

• X A Y: the longest common prefix between X and Y; 

• X(<) < Y: X is a (proper) prefix of Y; 

• e: an empty attribute sequence. 



1245 



Given a set of attributes A and a relation R, we define R' C R 
to a A-group of R if R' consists of all the tuples in R that share 
the same value(s) for attribute(s) in A; i.e., |IIa(.R')I = 1 and 
U A {R') n n A (R - R') = 0. Thus, R is a union of \n A (R)\ 
disjoint A-groups. 

Each window function wf j is represented by a pair (WPKj, WOK,), 
where WPK; is a set of partitioning key attributes and WOK; is a se- 
quence of ordering key attributes. For simplicity and without loss 
of generality, we assume that the attributes in WOK; are all ordered 
in ascending order. 

3. WINDOW FUNCTION EVALUATION 

In this section, we consider the evaluation of a single window 
function wf = (WPK, WOK) on a relation R. We first introduce a key 
concept termed segmented relation to characterize window function 
evaluation. Next, we present two new tuple reordering techniques, 
namely, Hashed Sort and Segmented Sort, to derive a segmented 
relation that matches a window function. We also present the cost 
models for these techniques and analyze their tradeoffs. Finally, we 
briefly describe how the execution of Hashed Sort and Segmented 
Sort can be parallelized for further performance improvement. 

3.1 Segmented Relation 

Definition 1 (Segmented Relation) Consider a relation R, where 
X is a subset of attr(R), the attributes in R, and Y is a se- 
quence of some attribute(s) in attr(R). We define R to be a seg- 
mented relation w.r.t X and Y, denoted by Rx,y, if R is ordered 
such that it is a sequence of k (> 1) disjoint, non-empty seg- 
ments, Ri , i?2 , ■ ■ ■ , Rk, that satisfy all the following properties: 

(1) I Ri = R; (2) the X values in each pair of segments are 

disjoint (i.e., Hx(Ri) n Hx(Rj) = 0, Vi, j G [1, k], i 7^ j); and 
(3) each segment is sorted on Y. 

Note that if X = 0, Rx,y is totally ordered on Y and consists of 
exactly one segment R; if X — and Y = e, Rx,y is unordered. 

In general, each segment of Rx.y consists of one or multiple 
X -groups of R. For the special case where each segment Ri of 
Rx,y consists of exactly one X-group (i.e., |IIx(.Ri)| = 1), we 
say that Rx.y is grouped on X and denote it by R 9 X Y . In R x Y , 

each segment is also ordered on every permutation X U attriY 1 ) 
that preserves the sequence of attr(Y). 

Definition 2 Given a segmented relation Rx.y and a window func- 
tion wf = (WPK, WOK), Rx,Y is said to match wf (or wf is matched 
by Rx,y) if X C WPK and there exists some permutation WPK of 
WPK such that WPK o WOK < Y. More generally, given a set of win- 
dow functions W, Rx,y is said to match W if Rx,y matches each 
wf G W. 

Example 2 Each of the segmented relations R®,( a ,b,c)> R{a},(b,a,c) 
and R{ b y r a c \ matches the window function wf = ({a, b}, (c)). □ 

A segmented relation that matches a window function has the 
following useful property. 

Theorem 1 If R matches a window function wf, then wf can be 
evaluated on R by a sequential scan of R without any reordering 
operation. 

We explain the intuition of Theorem 1 by considering the evalu- 
ation of wf = (WPK, WOK) on Rx.y. Since X C WPK, each segment 
Ri of Rxy consists of one or multiple WPK-groups. Furthermore, 
since each Ri is ordered on Y and there exists some permutation 



WPK such that WPK o WOK < Y, each Ri is necessarily also sort- 
ed on WPK o WOK and can be viewed as a concatenation of one or 
more WPK-groups. It follows that each WPK-group in Ri is ordered 
on WOK. Thus, wf can be evaluated by a sequential scan of the se- 
quence of ordered WPK-groups in Rxy- 

Based on Theorem 1, to evaluate a window function wf = (WPK, 
WOK) on R, it suffices to reorder R (if R does not match wf) to 
obtain a Rxy that matches wf and then sequentially scan Rxy- 
The most straightforward approach to achieve this reordering is the 
Full Sort (FS) technique which sorts R on WPK o WOK for some per- 
mutation WPK of WPK. The sorted result R', which is essentially 
i?0 jjj£ owaK , trivially matches wf . However, note that the total order- 
ing of R' on WPK o WOK is actually unnecessary for the purpose of 
computing wf , which only requires the input tuples to be partially 
sorted (i.e., a sequence of tuple partitions grouped on WPK and then 
each sorted on WOK). 

Before we present more efficient reordering techniques in the 
next two sections, we first introduce the notion of reorderability. 

Definition 3 (Reorderable) Given a window function wf , a rela- 
tion R, and a reordering technique O, we say that (7?,wf) is O- 
reorderable if R could be reordered by O such that the reordered 
relation matches wf . More generally, given a set of window func- 
tions W, we say that (R, W) is O-reorderable if (R, wf) is O- 
reorderable for each wf G W. 

3.2 Hashed Sort Technique 

Hashed Sort (HS) reorders R wrt wf in two steps: the first step 
partitions R into a collection of buckets by hashing 7? on some 
hash key, WHK C WPK, and the second step sorts each bucket Ri on 
a sort key WPK o WOK for some permutation WPK of WPK. Thus, HS 
essentially reorders R to obtain 7? WHK ^ oH0K which matches wf . In 
order that HS does not degenerate to FS, we require that WHK 7^ 0; 
thus, (R, wf ) is HS-reorderable if WPK / 0. 

Example 3 R can be reordered by HS to match wf = ({a, b}, (c)) 
if WHK is {a}, {b} or {a, b}, and WPK is {a, b) or (6, a). □ 

The details of HS are as follows. The first step sequentially scans 
R to build the buckets by hashing on WHK. HS tries to maintain as 
many buckets resident in the allocated main-memory as possible. 
Whenever the memory is full, HS picks a bucket Ri to be flushed 
to disk, and any subsequent tuple for Ri will be flushed to disk. At 
the end of the partitioning step, some of the buckets are resident in 
main-memory while the remaining ones are resident on disk. The 
second step will first sort the memory-resident buckets before the 
disk-resident ones. 

HS can be further optimized as follows. If statistics on the WHK 
values are available (e.g., histograms on a base relation R), it is 
possible to estimate the most frequent WHK values (MFVs), each 
of which corresponds to a set of tuples whose total size exceed- 
s the size of sorting memory. Tuples with such values belong a 
special bucket R x that will be immediately pipelined for sorting 
(i.e., without being cached in main memory or flushed to disk in 
contrast to other tuples). Therefore, R x is sorted before any other 
bucket. Such an optimization could save up to one pass of I/O for 
R x . Moreover, it is likely to result in a larger set of in-memory 
hashed buckets which can be sorted internally. 

3.3 Segmented Sort Technique 

Segmented Sort (SS) is designed to reorder a relation Rxy to 
match a window function wf = (WPK, WOK). As shown below, SS 
performs the reordering by separately sorting each segment/group 
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of Rx.y, whose size is generally much smaller than the entire re- 
lation. Thus, SS is usually much more efficient than FS and HS. 

(Rx,y, wf ) is SS-reorderable if one of the following conditions 
hold: either (1) X and X C WPK, or (2) X = and there 
exists some permutation Wpt of WPK such that (wk o WOK) A Y is 
non-empty. Specifically, SS reorders Rx.y to R x gjJ oWDK for some 

permutation WPK of WPK. Note that if X = 0, we need to choose 
a permutation WPK such that (WPK o WOK) A Y is non-empty; this 
permutation must exist by the applicability requirement of SS. As 
we shall explain later, the constraint imposed on WPK when X — 
is to ensure that SS does not degenerate to FS, which requires sort- 
ing the entire Rx,y- By Definition 2, it follows that the reordered 
relation R x ^ oW0K matches wf since X C WPK. 

We now explain how R x ^ oWDK is derived from Rx,y- Let 

q = (WPK o WOK) A Y. Thus, WpI o WOK = a o /3, where f3 is 
some sequence of attribute(s) 1 . There are two cases to consider de- 
pending on whether a is empty. Consider the general case where a 
is non-empty. Since a < Y, therefore each segment Ri of Rx,y, 
which is ordered on Y, is necessarily also ordered on a. Thus, 
each segment Ri is actually a sequence of a-groups. By separate- 
ly sorting each of these a-groups on /?, each segment Ri becomes 
sorted on a o /3 = WPK o WOK. Thus, we have reordered Rx,y to 

^X.WPKoWOK' 

Example 4 Consider using SS to reorder R to R' wrtwf = ({a,b}, 
(c)). If R = i?0,( a ,d), then a = (a) and R' = i?0,( o ,6, c )- If 

R = ft{a},(a,!>,d), tllen Q = (°J & ) and R ' = R {a},{a,b,c)- If R = 
R {b},(a,d)> then Q = 0> fc ) alld = R {bUa,cy D 

Consider the second case where a is empty (i.e., P = WPK o 
WOK). By separately sorting each segment Ri of Rx.y on /3, we 
have reordered Rx,y to i? x ^ oWDK - Note that by the applicability 
requirement of SS, a is empty necessarily implies that X 7^ 
as otherwise, we would have selected a permutatation of WPK that 
guarantees that a is non-empty. As we alluded earlier, the reason 
to avoid having both X and a being empty is to ensure that SS 
does not degenerate to FS: if X were empty (i.e., Rx,y consists 
of a single segment), then SS would essentially be performing a 
complete sort of the entire Rx,y on ft. 

Example 5 Consider two examples of using SS to reorder R to R' 
wrt wf = ({a, b}, (c)). If R = R{ a },(d) then R' = R{ a y,( a ,b,cy If 
R = R{b},(c) then R' = J?{b},( a ,i,, c )- □ 

Observe that there is generally more than one way for SS to re- 
order Rx,y to match wf depending on the choice of WPK. For the 
general case where a is non-empty, it makes sense to choose the 
permutation of WPK that maximizes the total number of distinct val- 
ues of a 2 so that the size of each a-group within each segment is 
minimized resulting in more efficient sorting and hence reordering. 

We remark that the partial sort operation [7, 13] is essentially 
an instance of SS, where the input relation is R@,y, the window 
function to match is wf = (0,WOK) and Y < WOK. Clearly, SS has 
much broader applicability than the partial sort. 

We conclude the discussion on SS by presenting a useful proper- 
ty (cf. [9] for the proof due to space limitation) for reasoning about 
SS-reorderability. 

Theorem 2 Letvii and wf 2 be two distinct window functions, and 
let R be a relation. 



'Note that since Rx,y does not match wf , a 7^ WPK o WOK. There- 
fore, /3 contains at least one attribute. 

2 Since Y is fixed for a given Rx,y, this translates to maximizing 
the number of attributes in a. 



1. If R matches wf 1 and R' is the output produced by evaluat- 
ing wfi on R, then (R,v±2) is SS-reorderable iff (R' , wf 2) 
is SS-reorderable. 

2. If(R, wf 1) is SS-reorderable and R' is produced by reorder- 
ing R with SS to match wf 1, then (R, wf 2) is SS-reorderable 
iff(R', wf 2) is SS-reorderable. 

Theorem 2 essentially states that the SS-reorderability property 
of a relation R (wrt some window function wf 2) is preserved by 
two types of transformation of R to R': (1) evaluating some win- 
dow function wf 1 on R to obtain R', and (2) reordering R (wrt 
some window function wf 1) using SS to obtain R'. By preserva- 
tion, we mean that (ii, wfa) is SS-reorderable iff (i?',wf2) is SS- 
reorderable. This property will be used in our optimization frame- 
work in Section 4. 

3.4 Cost Models and Analysis 

In this section, we present cost models for reordering a relation 
R (of the form Rx,y) to match wf = (WPK, WOK) using the re- 
ordering operators FS, HS and SS. Since a reordering operator may 
pipeline its output (to another operator) while the reordering is still 
in progress, our cost models include the cost of outputting the re- 
ordered relation but exclude the cost of reading the input relation. 

We use Cost(R, O) to denote the cost of reordering R using 
operator O, M to denote the allocated main memory (in number 
of blocks) for the operation, and B(Ri) to denote the size of a 
relation/segment/group Ri (in number of blocks). Let k denote the 
number of segments in Rx.y- 

FS is based on the standard, external merge-sort algorithm con- 
sisting of two phases: an initial run formation phase that creates 
sorted subsets, called runs, and a merge phase that merges runs into 
larger runs iteratively, until a single run is created. Assuming that 
replacement selection is used to create the initial sorted runs, the 
size of each initial run is 2M blocks. The sorted runs are merged 
using the well-known F-way merge pattern, where F is the merge 
order (i.e., number of runs that can be simultaneously merged using 
M). Therefore, 

Cost(R, FS) = 2 X B(R) X (\log F (^-)} + 1) (1) 

For HS, we assume that the values of WHK follow a uniform dis- 
tribution. If the number of distinct values of WHK in R, denoted 
by D(WHK), is large enough, we expect that each generated hashed 
bucket will be small enough to fit into main memory and thus, it 
can be internally sorted; otherwise, if D(WHK) is very small, the 
hashed buckets may require external sortings. As such, we estimate 
the total number of generated hashed buckets as iV = D(WHK). 
Therefore, B(Ri) is estimated as B(R) /N for each hashed bucket 
Ri. The number of hashed buckets that are never flushed to disk is 
N' = [M x N/B(R)\. Therefore, 

Cost(R, HS) = 2 x B(R) x (1 ) + Cost{R t ) (2) 

i — 1 

where Cost(Ri) denotes the cost of (internally or externally) sort- 
ing the i th hashed bucket Ri. Since the sortings in HS incur possi- 
bly less I/O cost (due to possibly fewer run merge passes) than FS, 
J2iLi Cost(Ri) is expected to be lower than Cost(R, FS). As 
such, Cost(R, HS) is lower than Cost(R, FS) when the value of 
Cost(R, FS) - J2i=i Cost{Ri) is large enough, i.e., when M is 
small. Thus, we expect that HS is generally comparable to FS, but 
HS will outperform FS when M is small. 

For SS, recall from Section 3.3 that SS reorders by indepen- 
dently sorting either segments of Rx,y if a is empty; or a-groups 
within each segment of R, otherwise. For convenience, we refer 
to each segment/group being sorted as a unit. To model the sorting 
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cost, we need to estimate the number and size of the units. Let u de- 
note the number of units in each segment of R. We assume that the 
attributes of R follow the uniform distribution and are uncorrelated 
with each other. As such, for each segment Ri, B(Ri) = B(R)/k. 

There are two cases to consider depending on whether a is emp- 
ty. For the case where a is empty, each unit is a segment and u = 1. 

We now consider the case where a is non-empty. Note that each 
segment contains a proper subset of the distinct X values in R. If 
a segment is large enough, we assume that it contains all of (resp. 
1/fc of) the distinct a values in R when attr(a)f~)X is empty (resp. 
non-empty); otherwise, we assume that each tuple in the segment 
has a distinct a value. Therefore, 

f min{T{R)/k,D{a)) if attr(a) n X = 
U ~ \ min(T(R)/k,D(a)/k) otherwise. 

where T(R) denotes the number of tuples in R and D(a) denotes 
the number of distinct values of a in R. Thus, R contains a total 
of k * u units, each of which has a size of B(R) /{k * u) blocks. 
Therefore, 

K*U 

Cost{R, SS) = ^2 Cost(Ui) (3) 

i=l 

where Cost(Ui) denotes the cost of (internally or externally) sort- 
ing a unit Ui. 

SS can be very efficient without incurring much or any I/O over- 
head, especially when the units to be sorted are small. Comparing 
Eqs. 2 and 3, SS is at least no more expensive than HS. More- 
over, compared with FS, SS has significantly fewer number of tu- 
ple comparisons without incurring extra I/O cost: the complexity 
of independently sorting k segments each of n/k tuples is 0(k * 
n/klog(n/k)) = 0(nlog(n/k)) compared to a complexity of 
0(nlog(n)) for a single sort of all n tuples. Thus, the cost of SS 
is expected to be generally lower than FS and HS. 

3.5 Parallel Execution 

The evaluation of a window function wf = (WPK, WOK) on a 
relation R can be easily parallelized by partitioning the tuples of 
R by either hash or range partitioning on the WPK attributes. The 
window function wf can then be evaluated in parallel on each data 
partition. It is easy to see that if (i?,wf) is SS-reorderable (resp. 
HS-reorderable), then each data partition can also be processed by 
reordering its tuples with an appropriate SS (resp. HS) operation. 

4. OPTIMIZATION OF MULTIPLE WIND- 
OW FUNCTION EVALUATIONS 

In this section, we consider the general problem of evaluating a 
set of window functions in a query. Specifically, the problem is to 
optimize the evaluation of a set of window functions W = {wf i, 
■ ■ ■ , wf „} on a relation R, where each wf i — (WPK^, WOK;) and R is 
of the form Rx,y for some set of attributes X and some sequence 
of attributes Y. Recall that if R is an unordered relation, then X is 
an empty set and Y is an empty sequence. 

4.1 Evaluation Model 

The window functions in W are evaluated sequentially based on 
some ordering of the window functions. Let (wf i, • • • , wf n ) denote 
the chosen evaluation order, and let Ij and Oj denote, respectively, 
the input and output relations of the evaluation of each wf j £ W. 
Each window function wf i is evaluated by the following two steps. 
First, if Ij does not match wf j, then reorder to I'j using an appli- 
cable reordering technique (i.e., FS/HS/SS) such that I'j matches 
wf j. For convenience, let I'j denote Ij if there is no reordering. 
Second, sequentially scan I'j to compute wf j. Note that Ij is the 
original relation Rx,y if j = 1; otherwise, Ij is Oj-i. 



The above sequential evaluation model is implemented in sever- 
al database systems including DB2, Oracle, SQL Server 3 and Post- 
greSQL. 

To optimize the evaluation of W, we need to choose an evalua- 
tion order of the window functions and choose a reordering tech- 
nique for each window function that is not matched by its input re- 
lation. The following result establishes that this optimization prob- 
lem is NP-hard. 

Theorem 3 The problem of finding the lowest-cost evaluation plan 
for an input set of window functions is NP-hard. 

The proof is established by reducing the Travelling Salesman 
Problem [14] to a special case of the problem, where a compul- 
sive FS will be used to reorder the input of every window function 
(cf. [9] for the complete proof). 

4.2 Overview of Our Approach 

Given the NP-hardness of optimizing a set of window function 
evaluations, in this paper, we present an efficient heuristic to solve 
the problem. Our approach optimizes the evaluation of W by min- 
imizing two key aspects: (1) the number of reorder operations, and 
(2) the usage of FS and HS (which are generally less efficient than 
SS) for reordering. 

Note that as each window function evaluation computes an addi- 
tional column to store the derived values for some analytic function, 
the size of the input relation for each window function evaluation 
actually becomes larger as the evaluation progresses. However, for 
tractability reasons, our optimization framework makes a simplify- 
ing assumption that the size of the input and output relations for 
each window function evaluation are the same. We refer to this as 
relation size assumption. As the number of window functions is not 
too many, the additional columns introduced by the window func- 
tion evaluations are relatively small compared to the tuple size. In 
Section 4.6, we discuss how our approach can be further optimized 
to mitigate this assumption. As we shall see in the experimental 
results, our optimization framework is effective even with this sim- 
plifying assumption. 

The following two examples illustrate the intuitions for our opti- 
mization framework. 

Example 6 Consider the evaluation of W — {wf i = ({a}, (6)), 
wf 2 = ({a}, e} on an input relation i?0 i£ , which matches none of 
wf i and wf 2. If we first reorder R into R@,(a,b) for evaluating wf i, 
then the output of wf i directly matches wf 2. In contrast, if we first 
reorder R into i?0,( a ) for evaluating wf 2, then we need an extra SS 
operation to reorder the output of wf 2 for evaluating wf 1. □ 

Example 7 Consider the evaluation of W = {wf 1 = ({o, b},e), 
wf 2 = ({a}, (c))} on an input relation i?0 j£ , which matches none 
of wf 1 and wf2. Suppose we first reorder R with a FS operation 
for evaluating wf 1. If R is reordered to ^,(0,6)! then we just need 
a SS operation to reorder the output of wf 1 for evaluating wf 2. On 
the other hand, if R is reordered to Rtj,(b, a ), then we need a more 
expensive FS/HS operation to reorder the output of wf 1 for wf 2. □ 

As illustrated by Example 6, a useful strategy to reduce the num- 
ber of reorder operations is to identify a subset Wi of window func- 
tions that can be matched by a common segmented relation Ri . The 
idea is that instead of incurring possibly one reorder operation to e- 
valuate each window function in Wi, we can just perform a single 
reordering of the input relation to derive Ri which can then be used 

3 For the commerical DBMS, our conclusions are drawn from view- 
ing the physical plans of our test queries. We note that Oracle also 
supports parallel evaluation of a single window function [4, 5]. 
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to evaluate W%. In the following, we formalize the required proper- 
ties for the above evaluation idea. 

Theorem 4 Let Rbea relation on which a set of window functions 
W are evaluated. If R matches W, then for any evaluation order 
(wf i, ■ ■ • , wf „) of the window functions in W, the output relation 
Oi (produced by the evaluation of wf ^ on U) matches W for each 
wf i G W. 

Corollary 1 If R matches a set of window functions W, then for 
any evaluation order ofW, W can be evaluated on R without any 
reordering operation. 

Thus, by Corollary 1 (which is a generalization of Theorem 1), 
if a set of window functions W is not matched by a relation R and 
it is possible to reorder R to R' so that R' matches W, then we can 
evaluate W on R by a single reordering operation. Specifically, for 
any evaluation order (wf i, • • • , wf n ) of W , the evaluation of wf i 
requires a reordering of R to R'\ subsequently, the evaluation of 
each wf i, i > 1, does not require any reordering. 

We next characterize an important property for a relation to match 
a set of window functions. 

Definition 4 A set of window functions W is defined to be a cov- 
er set if there exists a permutation WPKi o/WPK; for each window 
function wf , G W and a window function wf c G W such that 
WPKt o WOKi < WPKc o VOKcfor each wf t G W - {wf c }. The win- 
dow function wf c is defined to be a covering window function ofW, 
and WPK C o WDK C is defined to be a covering permutation of wf c . 

Example 8 Consider W — {wf i, wf 2, wf 3}, where wf 1 = ({a,b, 
c}, (d)), wf 2 = ({a, 6}, (c, d)), and wf 3 = ({a, b}, (c)). W is a 
cover set with two covering window functions wf 1 and wf 2. □ 

Theorem 5 If a relation R matches a set of window functions W, 
then W is a cover set. 

Theorem 5 (cf. [9] for the proof) suggests the following idea to 
optimize the evaluation of W . Let W be partitioned into a collec- 
tion of cover sets, W = Co U ■ • ■ U C\, such that each d is 
evaluated before d+i. 

For each cover set d, if the input relation to d either matches 
the first window function in d or can be reordered to some relation 
that matches the first window in d, then it follows that each d can 
be evaluated by at most one reordering operation, and therefore, 
W can be evaluated using at most (k + 1) reorderings. Thus, by 
minimizing the number of cover sets in the partitioning of W, the 
number of reorder operations to evaluate W can be minimized. 

Our evaluation approach builds on the above idea to evaluate W 
as a sequence of cover set evaluations, where each cover set eval- 
uation is a sequence of window function evaluations. We elaborate 
on this cover set-based evaluation strategy in the next section. 

4.3 Cover Set-based Evaluation 

Before we present our approach, we shall introduce some nota- 
tions. For each cover set d, let Z; and d denote, respectively, 
the input and output relations of the evaluation of d; i.e, Xj, is the 
input relation to the first window function in d, and Oi is the out- 
put relation produced by the last window function in d. Let wf * 
denote the first window function that is evaluated in cover set d. 

To minimize both the number of reorder operations as well as 
the number of reorderings performed using FS/HS, our approach 
partitions W into three disjoint subsets, W = Co U d U C2, such 
that Co is evaluated first, followed by Ct, and finally C2. Note that 
each d could possibly be empty. 



Co is the set of window functions in W that are matched by 
the input relation Rx,y- Thus, Co is necessarily a cover set (by 
Theorem 5), and by Corollary 1, each window function in Co can 
be evaluated without any reordering. 

Since Rx,y does not match any wf ; G W — Co, the remaining 
set of window functions (i.e., Cl U C2) requires at least one re- 
ordering to be evaluated. To minimize the usage of FS/HS reorder- 
ings, Ci is defined to contain all the window functions in W — Co 
such that (Rx,y, Ci) is SS-reorderable. By Theorem 2, (Co, Ci) 
is necessarily also SS-reorderable, where Co is the output relation 
produced by the evaluation of Co. Thus, Ci can be evaluated us- 
ing only SS reorderings (i.e., FS/HS reorderings can be avoided). 
To minimize the number of SS reorderings to evaluate Ci, our ap- 
proach further partitions Cl into a minimum number of disjoint 
cover sets, Cl = Ci,i U ■■■ U Ci, mi . The details are explained 
in Section 4.4. 

The evaluation of C2, however, is more intricate as it requires 
at least one FS/HS reordering and zero or more SS reorderings. 
To minimize the number of reorderings to evaluate C2, C2 is also 
evaluated in terms of a collection of cover sets, C2 = (C2,i U 
■■■ U C 2 ,m 2 ) U ••• U (C fc ,i U •■• U C k ,m, k ). The details are 
explained in Section 4.5. 

The overall organization of our cover set-based evaluation is de- 
picted in Figure 2. Each circle in Figure 2 represents a single win- 
dow function evaluation, where the color indicates the technique 
used for reordering (if any): white means that there is no reorder- 
ing, and gray (black, resp.) means that the input relation is re- 
ordered using SS (FS or HS, resp.). The window functions within 
each box represent a cover set, and the chain of window function 
evaluations are connected by the directed edges. Except for Co, 
which is evaluated without any reordering, each of the cover sets 
dj requires exactly one reordering for its evaluation, and the re- 
ordering is performed as part of the evaluation of the first window 
function. 

Co Cl,l Cl,mi 

|o- -oi — ■ !•—:> ■ <~\ ■■■ H»~- > ; : ~<^ =- 




<=* \9-*o* ■■■ *o\ — 40— o> ■ ■ ■ K)f ■■■ H<»— o>_-Tq1 ^ 
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Figure 2: Cover set-based evaluation approach 

In the following, we discuss the optimization of the evaluations 
of Ci and C%. 

4.4 Evaluation of Ci 

As explained in the previous section, (Co, Ci) is SS-reorderable, 
and the number of required SS reorderings to evaluate Cl is min- 
imized by partitioning Cl into a minumum number of cover sets. 
However, the problem of finding such an optimal partitioning of Cl 
is NP-hard as the following result shows. 

Theorem 6 The problem of partitioning a set of window functions 
W into a minimum number of cover sets is NP-hard. 

The proof is established by a reduction from the Minimum Vertex 
Coloring Problem [14] (cf. [9] for the complete proof). 

As such, the partitioning of Cl can be solved using an efficient 
heuristic (e.g., Brelaz's heuristic algorithm [6] for the minimum 
vertex coloring problem). 
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Assume that Ci has been partitioned into mi cover sets, C\ = 
Ci,i U ■ ■ ■ U Ci, mi , and the evaluation order is Ci,i, ■ • ■ , Ci, mi . 
Thus, = Co. Since (Oo,Ci) is SS-reorderable, it follows 
from Theorem 2 that for each Cij, , Cij) is SS-reorderable. 

Each Ci can be evaluated on Xij using a single SS-reordering 
based on the following result (cf. [9] for the proof). 

Theorem 7 Consider the evaluation of a set of window functions 
W on a relation Rx,y, where W is a cover set and (R, W) is SS- 
reorderable. Let Rx,y' be the output relation produced by a SS 
reordering of R wrt a covering function wf c ofW such that Y' is 
a covering permutation of wf c . Then Rx,y' matches W. 

Let Xij be of the form Rx.y- To apply Theorem 7 to evaluate 
Cij onlij, we choose a covering function of Cij to be wf * the 
first window function to be evaluated in Cij. Since 2a, j does not 
match wf ij but (Iij, wf*^) is SS-reorderable, we reorder X\j to 
I[j using SS wrt wf *j such that X[j is of the form R x> y', where 
Y' is a covering permutation of wf *j . By Theorem 7, X[ j matches 
Cij , and therefore each wf i G Cij can be evaluated without any 
further reordering. 

4.5 Evaluation of C 2 

By definition of Ci, for each wf « G C2, wf j is not matched by 
Rx,y and (Px,Y,wfi) is not SS-reorderable. Furthermore, it fol- 
lows from Theorem 2 that (d, mi , wf j) is also not SS-reorderable, 
where 0i, mi is the output relation produced by the evaluation of 
Ci. Therefore, the evaluation of C2 requires at least one reordering 
using FS/HS. 

To minimize the number of FS/HS reorderings to evaluate C 2 , 
we partition C2 into a minimum number of partitions, C2 = P2 U 
■ ■ ■ U Pfe 4 , such that each Pi can be evaluated with exactly one 
FS/HS reordering and zero or more SS reorderings. To minimize 
the number of SS reorderings required for evaluating each Pi, we 
further partition each Pi into a minimum collection of cover sets, 

Pi = Ci.l U ■ ■ ■ U Ci^rni- 

The collection of cover sets in C2 are evaluated in the follow- 
ing order: each P, is evaluated before P;+i, and within each Pi, 
each Cij is evaluated before dj+i. The entire order of cover set 
evaluations is shown in Figure 2. 

Note that within each Pi, it is necessary for the first cover set 
Ci,i to be reordered using FS/HS. This is a consequence of the fact 
that (Px,r,wf) is not SS-reorderable for each wf £ C2. Thus, 
within each Pi, d.i is reordered using FS/HS, while each of the 
remaining cover sets dj ,j> 1, in Pj is reordered using SS. 

For the above evaluation strategy for C2 to be feasible, it is nec- 
essary that Ci t j) is SS-reorderable for each Pi in C 2 and for 
each j G [2, mj, so that each of the cover sets in Pi (except for 
the first) can be reordered using SS. The following result states the 
required property for this strategy to work. 

Definition 5 (Prefixable) A set of window functions W = {wf 1, 
• ■ • , wf „} is defined to be prefixable if for each wf j G W, there 
exists a permutation WPKi ofWPKi such that /\™ =1 (WPKi o WOK;) is 
non-empty. 

Theorem 8 Let a set of window functions W be evaluated on a 
relation R, where for each wf ; G W, R does not match wf ; and (R, 
yfi) is not SS-reorderable. W can be evaluated with one FS/HS 
reordering and zero or more SS reorderings iffW is prefixable. 

4 For notational convenience, we label the partitions of C2 to start 
from P2 instead of Pi. As each Pi is further partitioned into cover 
sets Cij, this ensures that the cover sets of C2 are distinctly labeled 
from those of C\. 



Based on Theorem 8 (cf. [9] for the proof), our evaluation s- 
trategy for C2 requires that each Pi be prefixable. However, the 
problem of finding such an optimal partitioning of C2 is NP-hard 
as the following result shows. 

Theorem 9 The problem of partitioning a set of window functions 
W into a minimum number of prefixable, disjoint subsets is NP- 
hard. 

The proof is established by reducing the Minimum Set Cover prob- 
lem [14] to a special case of the problem (cf. [9] for the complete 
proof). The partitioning problem can be solved using a greedy 
heuristic that tries to minimize the number of prefixable subsets 
by maximizing the number of window functions in each prefixable 
subset under construction. The details of the heuristic are given 
elsewhere [9] and it has OdW 7 ! 2 ) time-complexity. The effective- 
ness of the heuristic has been validated by our experimental results 
in Section 6.2, where it succeeded in finding the optimal partition- 
ing of C2 for all tested window queries. 

Assume that C2 has been partitioned into k prefixable subsets, 
C2 = P2 U • • • U Pfc, and each Pi has been partitioned into m; 
cover sets, Pi = d,i U ■ ■ ■ U d tmi as discussed. 

Each Pi is processed using two main steps. In the first step, we 
reorder I^i (wrt wf * x in Ci,i) to X' ix such that the following two 
properties are satisfied: (1) d t i can be evaluated with exactly one 
reordering (using FS/HS); and (2) each of the remaining dj in Pi 
can be evaluated with exactly one reordering (using SS). 

For this reordering operation, if both FS and HS are applicable, 
the choice of which technique to apply is determined in a cost- 
based manner. We discuss these two cases of reordering in the 
following two subsections. 

The second step evaluates each dj as follows. By Property 1, 
we use X' i ± to evaluate d.i without any further reorderings. By 
Property 2, we use X' itl to evaluate Pi — {C;,i} following the same 
procedure as evaluating C\ using Co- 
in the ensuing discussion, we will present the details of the first 
step of reordering Xi.i using FS/HS. 

4. 5. 1 Reordering with fs 

We first discuss how to reorder Xi.i (wrt wf * x ) to l| 1 using 
FS. The main task is to determine the sort key for FS such that it 
satisfies two properties: (\)X' il matches C;,i, and (2) (Z^' ^wf*.,) 
is SS-reorderable for each j G [2,771;]. Property 1 ensures that 
C;,i can be evaluated using exactly one reordering with FS, and 
Property 2 ensures that each of the remaining dj in Pi can be 
evaluated using exactly one reordering with SS. 

The sort key for FS is derived as follows. We choose a covering 
function of d.i to be wf * j. Let 9 (Pi) denote the longest intersec- 
tion among /\ uf gp (WPKj o WOKj) for every permutation WPKj of 

each WPKj . Since Pi is prefixable, 9(Pi) is a sequence consisting of 
at least one attribute. Note that 9(Pt) might not be unique. As an 
example, in Example 8, 9(W) could be abc or bac. By definition 
of 9 (Pi), for each wf j G Pi, there exists a permutation WPKj of 
WPKj such that 9 (Pi) < VPk) o WOKj. 

Let 7 denote a covering permutation of wf * a such that 9(Pi) < 
7. Since Pi is prefixable and wf * a is a covering function of Ci,i, 
7 must exist. It follows that if we use 7 as the sort key for FS to 
reorder I^i, Property 1 is guaranteed by the fact that 7 is a covering 
permutation of wf| 1( and Property 2 is guaranteed by the fact that 

0(P) < 7 5 - 

5 Note that requiring 9(Pi) < 7 is actually sufficient but not nec- 
essary for Property 2. Specifically, so long as 9' < 7, where 9' is 
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4. 5. 2 Reordering with hs 

We next discuss how to reorder X^i (wrt wf * x ) to 1 using 
HS. Recall that HS is applicable if WPK;,i ^ 0. Similar to FS, for 
HS, we need to choose the hash key, WHK, and sort key such that 
Properties 1 and 2 are satisfied. 

Let 9' be the maximum prefix of 0(Pi) such that attr(6') C 
WPKj for each wf j £ C;,i. To satisfy Properties 1 and 2, it suffices 
to choose any subset of 8' for WHK. The selection of the sort key 
follows the same approach discussed in Section 4.5.1 for FS. 

4.6 Further Optimization 

In this section, we discuss some evaluation order issues relat- 
ed to our framework and discuss how our approach can be further 
optimized to mitigate the relation size assumption (Section 4.2). 

Based on the preceding discussion, the evaluation plans pro- 
duced by our evaluation framework (see Figure 2) are actually only 
partially ordered in that the evaluation order of some of the cov- 
er sets as well as window functions within a cover set could be 
reshuffled (without affecting correctness) for further optimization. 
Specifically, the following cover sets (csets) or window functions 
(wfs) can be reshuffled for further optimization: ( 1 ) the wfs within 
Co, (2) the csets of Ci, (3) the P 4 's of C 2 , (4) the csets of each Pi 
in Ci, (5) the choice of covering functions for the first wf within 
each cset (except Co), and (6) the non-first wfs within each Ci,j, 
i e [l,k],j € [l,m;]. 

To address the relation size assumption, one reasonable heuristic 
to reshuffle the above wfs/csetsAp's (refer to as units) is to order 
the units in increasing size of the extra column(s) produced by their 
evaluations so that the negative effect of a unit that is producing 
larger additional columns is deferred to a later stage of the execu- 
tion chain. We intend to explore these further optimizations as part 
of our future work. 

5. INTEGRATED WINDOW QUERY OPTI- 
MIZATION 

In this section, we present two approaches to integrate the opti- 
mization framework presented in the previous section into the over- 
all query optimization process. 

A window query WQ is essentially a conventional SQL statemen- 
t Q augmented with a set of window functions W defined in the 
SELECT clause of Q. A loose integration approach to optimize WQ 
is to decompose the optimization task into a sequence of three op- 
timization sub-tasks. First, optimize Q (except for the DISTINCT 
and ORDER BY clauses) to produce a windowed table WT. Sec- 
ond, optimize the evaluation of W on WT using the optimization 
framework in the previous section to produce an output table WT'. 
Finally, optimize the evaluation of the remaining DISTINCT and 
ORDER BY clauses on WT'. 

While the loose integration approach offers a straightforward 
way to incorporate the window function optimization framework 
into a query optimizer, optimizing WQ as three separate sub-tasks 
can produce final query plans that are sub-optimal. For example, 
it is possible for a sub-optimal plan for the second sub-task to pro- 
duce WT' ordered in an "interesting" order, which could lead to a 
less costly plan for the final sub-task and thus lead to an overall 
cheaper query plan. 

This drawback can be alleviated by adopting a more tightly inte- 
grated approach to optimize WQ. Based on W and Q, we identify 

a non-empty prefix of 9(Pi), Property 2 is guaranteed. However, 
the stronger requirement that we presented is beneficial for per- 
formance reason as the subsequent SS reorderings could be more 
efficient from sorting smaller segments. 



a set IP of interesting order [16] and/or interesting grouping [15, 
18] properties. Intuitively, IP consists of potential properties of WT 
that could benefit the derivation of WT' from WT. For example, 
suppose Q contains a GROUP BY clause with a set of grouping 
attributes gpk. Then an interesting (order or grouping) property 
for WT would be for WT to be a segmented relation WT^ e or 
WT 'g that would lead to a non-empty Co U Ci for W. For each 
interesting property ip in IP, the query optimizer will generate an 
optimal subplan to generate the windowed table WTi P that is as- 
sociated with ip. In addition, the optimizer will also generate the 
optimal plan to produce an arbitrary windowed table WT without 
taking into account of any interesting property in IP. Corresponding 
to each WTi p (or WT ), we derive the optimal window function 
chain C for evaluating W on WTi P (or WT ). Furthermore, by 
reshuffling the Pi's of C2 in C (or the cover sets of Ci if C2 is 
empty), we also try to derive from C the cheapest chain C' that 
will result in a WT' ip (or WT' a ) (partially) satisfying the ordering 
requirement of the ORDER BY clause, so that an explicit sorting 
of WT' ip (or WT' ) could be avoided or a cheaper partial sorting 
could be applicable. In this way, by taking into account of the in- 
teresting properties to enlarge the plan search space for WQ, the 
optimal query plan will not be missed. 

6. PERFORMANCE STUDY 

We validated our ideas using a prototype built in PostgreSQL 
9.1.0 [1]. In our implementation, both the Hashed Sort (HS) and 
Segmented Sort (SS) are integrated into PostgreSQL as standard 
execution operators. Moreover, we modified the optimizer of Post- 
greSQL to support a total of four distinct optimization schemes, all 
of which generate window function chains as query plans: 
CSO: Our proposed cover set-based optimization scheme. 
BFO: The brute-force scheme, which enumerates and compares 
for a window query all the feasible execution plans that are 
based on FS, HS and SS. 
ORCL: The scheme adopted by Oracle 8i [5], It tries to cluster the 
window functions of a query into a minimum set of Order- 
ing Groups (OG) which are equivalent to our notion of cover 
sets. However, the leading window function of each OG is 
only FS-reorderable. 
PSQL: The naive scheme adopted by PostgreSQL 9.10, where the 
window functions of a query are evaluated strictly following 
their input sequence in the SELECT clause and each window 
function is only FS-reorderable. For a window function wf , 
the WPK in the sort key (WPK o WOK) of FS is exactly the in- 
put sequence of WPK attributes in the SELECT clause. The 
only optimization applied by PSQL is that, when a window 
function is matched by its input, the FS for it is omitted. 
All experiments were performed on a Dell workstation with a 64- 
bit Intel Xeon X5355 2.66GHz processor, 4GB memory, one 500G- 
B SATA disk and another 1TB SATA disk, running Linux 2.6.22. 
Both the operating system and PostgreSQL system are built on the 
500GB disk, while the databases are stored on the 1TB disk. 

6.1 Micro-benchmark Test on FS, HS, and SS 

In this experiment, we used a micro-benchmark test to compare 
the performance of FS, HS and SS, under various situations. To 
this end, we defined a window query template Q: 
SELECT * , rank ( ) OVER 

(PARTITION BY AttrSet ORDER BY AttrSeq) 

FROM T 

Q essentially evaluates a window rank ( ) function with WPK = 
AttrSet and WOK = AttrSeq over a windowed table T, where 
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Figure 3: Micro-benchmark test, part 1: FS vs. HS 



AttrSet, AttrSeq and T are all configurable. Another experi- 
mental parameter that we varied is the available operating memory 
dedicated to each tuple reordering operation, referred to as the li- 
nk reorder memory and denoted by At , whose values ranged from 
10MB to IOOOMb'. 

The execution of Q invokes a single tuple reordering operation 
for rank ( ) . For comparison, we directly measure the plan execu- 
tion cost of Q, which consists of the cost of tuple reordering as well 
as the cost of the subsequent window function calls which usually 
remains constant. 



Part 


Query 


T 


AttrSet 


AttrSeq 


1 


Ql 


wssales 


{ws_item_sk} 


(wssoldjimesk) 


Q2 


\ws_item_sk, 
ws -bill -customer _sk} 


Q3 


\ws -ware h oa sesk} 


2 


Q4 


wssaless 


{ws-quantity} 


[wsStem.sk) 


05 


ws_sales_g 



Table 1: Queries used in the micro-benchmark test 

We split the test into two parts. In the first part, T was the 
websales relation from the TPC-DS [2] benchmark. We generat- 
ed websales by using the official generator provided by TPC-DS 
and a scale factor of 100. The generated table had a total size of 
14.3GB and contained 72 million tuples, each of which had the av- 
erage tuple size of 214 bytes. All attributes in web.sales followed 
a uniform distribution. Since web.sales was totally unordered, the 
SS operation was inapplicable to reorder it. Thus, in this part we 
only compared the performance of FS and HS. We instantiated Q 
with three concrete queries, Ql, Q2 and Q3 (shown in Table 1), 
which represent three kinds of situations where the number of win- 
dow partitions are medium (204000), extremely large (71976736) 
and extremely small (16) respectively. 

The experimental results are depicted in Fig. 3, from which we 
have several observations. First, while the performance of FS was 
sensitive to M smaller than 150MB, relatively the performance of 
HS was very stable regardless of M. This is because the number of 
run merge passes in FS decreased from 6 to 1 as M increased from 
10MB to 150MB, while the sortings of hashed buckets in HS were 
either internal or external with a single run merge pass. Second, HS 
had huge (resp. decent) performance gains over FS when M was s- 
maller than 50MB (resp. between 50MB and 100MB), and lost out 
to FS when M was larger than 100M B. The reason for the per- 
formance loss of HS is that when M < 150MB, FS incurred just 
one pass of table I/O, but HS always incurred more than one pass 
of table I/O due to the table partitioning phase. However, we notice 
that in many situations, the performance loss of HS was negligible 
or insignificant. Third, we took a closer look into Q3, where each 
of the 16 window partitions had a large size of 900MB. Since we 
did not implement the optimization for HS, for each hashed bucket 
of HS in Q3, it always contained more than one window partition 



and thus had to be spilled to disk even when M reached 1000M- 
B. Thus, when M increased, the I/O performance of HS did not 
improve, while the total CPU cost for tuple comparisons became 
higher and higher. This explains the performance reduction of HS 
along with the increased M as shown in Fig. 3(c). 

As a summary, HS is expected to outperform FS when M is not 
very large. Moreover, another advantage of HS over FS is its per- 
formance stability under a very wide range of Ms. On the other 
hand, a potential drawback of HS is that, unlike FS, its output does 
not have a total ordering, which may benefit the next stage's oper- 
ations, e.g. an order by. 

In the second part of this test, we compared the performance 
of SS with FS and HS. We generated two different instances of 
T: websaless and web.sales.g, both of which were manually 
reordered from the websales table used in the first part of this 
test, websaless (resp. web.sales.g) are sorted (resp. grouped) 
on attribute w s. quantity . As such, we instantiated Q with two 
concrete queries, Q4 and Q5 (shown in Table 1), for which the 
SS operation is applicable. Note that in both Q4 and Q5, SS will 
separately sort each ws.quantity-gmup on ws-item.sk. 
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Figure 4: Micro-benchmark test, part 2: SS vs. FS and HS 

As shown in Fig. 4, SS outperformed both FS and HS by a large 
margin in all situations. These experimental results are consistent 
with the prediction of the cost models. 

6.2 Evaluation of Window Queries 

In this experiment, we measured the effectiveness of our cover 
set-based window function optimization scheme presented in Sec- 
tion 4. To this end, we generated a list of window queries, each of 
which has the form of 

SELECT *, W FROM websales 
where W represented a set of window rank ( ) functions and web- 
_sales (or ids for short) was the TPC-DS table utilized in the first 
part of the above micro-benchmark test. The set of attributes of 
wssales mentioned by the tested window queries are listed in Ta- 
ble 2. For convenience, in the rest of this section we will refer to 
these attributes using their corresponding abbreviations. 

The tested window queries were Q6, Q7, QS and Q9, whose em- 
bedded window functions are presented in Tables 3, 5, 7 and 9, re- 
spectively. Note that within each query, for two window functions 
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Attribute (abbr.) 


Attribute (abbr.) 


wa.sold.date.sk {date) 


ws.item.sk (item) 


ws.sold.time.sk {time) 


ws.bill.customer.sk ( bill ) 


ws.ship.date.sk (ship) 





Table 2: Attributes of wssales involved in the tested window 
queries, as well as their abbreviations 

wf i and wf j, if i < j, then wf ; preceded wf j in the SELECT clause. 
The optimized execution plans, i.e., window function chains, for 
the four tested window queries are shown in Tables 4, 6, 8 and 10, 



respectively. In a chain, wfi/i 



wfj (resp. wf;/uis — > wf.,) 



means that the table web.sales or the output of wf ; matches (resp. 
needs to be reordered by X — FS/HS/SS for) wf j. 

We chose three values, 50MB, 75MB and 150MB, for the unit 
reorder memory M allocated for each tuple reordering operation in 
a query plan. There are two reasons for 150MB being the maximum 
testing memory size. First, according to Fig. 3, in this experiment 
neither HS nor FS will strictly outperform the other. In so doing, we 
intended to examine the accuracy of our cost models for HS and FS 
proposed in Section 3.4. Second, it is observable from Fig. 3 and 
Fig.4 that, compared with 150MB unit reorder memory, a larger 
memory size will make little difference on the performance of HS, 
FS and SS and thus will not invalidate the conclusions reached 
below. 

We then compare the performance of CSO, WF, ORCL and 
PSQL according to each tested window query. 





WPK WOK 




WPK WOK 


wf 1 


{item} (date) 


wf 2 


{item} {bill) 



Table 3: Window functions contained by Q6 



Scheme 


M (in MB) 


Plan 


BFO/CSO 


50/75 


WS > Wf 1 > Wf 2 


150 


WS > Wf 1 y Wf 2 


CSO(vl) 


50/75/150 


CSO(v2) 


50/75 


HS HS 
WS > Wf 1 V Wf2 


150 


FS FS 
WS > Wf 1 > Wf2 


ORCL/PSQL 


50/75/150 



Table 4: Execution plans for Q6 




75 150 
Unit Reorder Memory (in MB) 

Figure 5: Evaluating different optimization schemes with Q6 

For Q6, the execution plans generated by CSO (resp. PSQL) 
are exactly the same as those by BFO (resp. ORCL), as shown 
in Table 4. As such, we additionally tested two variants of CSO, 
i.e., CSO(vl) where HS is disabled, and CSO(v2) where SS is dis- 
abled. The experimental results are depicted in Fig. 5. It is obvious 
that, by assigning SS to wf 2, BFO/CSO significantly improved the 
query performance of Q6. Moreover, when M was 50MB/75MB, 



the introduction of CSO(vl) and CSO(v2) illustrated the cost dif- 
ference between FS and HS for both wf 1 and wf 2 . Thus, we can see 
that BFO/CSO made correct decisions on the choice between FS 
and HS for a window function, by using our proposed cost models. 





WPK 


WOK 




WPK 


WOK 


wf 1 


{date, time, ship} 


£ 


wf 4 


(item, bill) 


Wf 2 


{time, date} 




wf 5 


{date, time, 


{ship) 


Wf 3 


{item} 


e 


item, bill} 



Table 5: Window functions contained by Q7 



Scheme 


71/ (in MB) 


Plan 


BFO 


50/75 


WS Wf 1 — > Wf2 — » Wfs — > Wf4 — > Wf3 


150 


ws wf 1 — y wf2 — — y wfs — y wf4 — y wf3 


CSO 


50/75 


ws wfs — y wf4 — y wf3 ^> wf 1 — y wf2 


150 


ws — y wf 5 — y wf4 — y wf 3 — — > wf 1 — y wf 2 


ORCL 


50/75/150 


PSQL 


50/75/150 


FS FS FS FS FS 

ws — y wf 1 — y wf2 — y wf3 — y wf4 — y wfs 



Table 6: Execution plans for Q7 




Unit Reorder Memory (in MB) 

Figure 6: Evaluating different optimization schemes with Q7 

Q7 is actually the running example utilized by [5] in order to 
illustrate the optimization mechanism of ORCL. As shown in Ta- 
ble 6, for this query BFO, CSO and ORCL all managed to find the 
execution plans with the same minimum set of cover sets. The d- 
ifferences between their plans lie in the choices between HS and 
FS for a single window function, as well as the sequence of win- 
dow functions (or cover sets) in the chain. In contrast, Q7 actually 
highlighted the naiveness of PSQL, which failed to recognize the 
rather obvious optimization opportunity where adjusting the sort 
key of FS for wf 1 is able to make the output of wf 1 matches wf 2. 
As a result, the performance of PSQL was much worse than that 
of BFO/CSO/ORCL, as depicted in Fig. 6. On the other hand, the 
small performance differences among BFO, CSO and ORCL gave 
rise to some observations. First, for Q7, once again both BFO and 
CSO made correct decisions on the choice between FS and HS for 
a window function, by using our proposed cost models. Second, 
when SS was not used, the sequence of cover sets in the chain 
has minor impact on the plan performance, and this is consistent 
with the intuition behind our relation size assumption made in Sec- 
tion 4.2. 

Q8 was derived from Q7 by moving the item attribute from 
WOK4 of wf 4 into WPK4 and also moving the bill attribute from WPK5 
of wfs into WOK5, according to Table 6 and Table 8. The resultant 
execution plans are listed in Table 8. We can see that each of BFO, 
CSO and ORCL generated a minimum but distinct set of cover set- 
s. However, unlike BFO and CSO, ORCL cannot recognize the 
additional optimization opportunity that one of the three leading 
window functions of cover sets, i.e., wfs here for ORCL, is actual- 
ly SS-reorderable. Therefore, as shown in Fig. 7, the performance 
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WPK 


WOK 




WPK 


WOK 


Wf 1 


{date, time, ship} 




Wf 4 


{item} 


(bill) 


Wf 2 


{time, date} 


e 


Wf 5 


{date, time, 


(bill, ship) 


Wf 3 


{item} 


E 


item} 



Table 7: Window functions contained by Q8 



Scheme 


AI (in MB) 


Plan 


BFO 


50/75 


WS > Wf i — > Uf2 > Wfs > Wf4 — > Wf3 


150 


> Wf i — > Wf2 > Wfs > Wf4 — > Wf3 


CSO 


50/75 


ws — > wfg — > Wf 1 — > Wf2 — > Wf4 — > 1S±3 


150 


WS > Wfs > Wf i — > Wf2 > Wf4 — > Wf3 


ORCL 


50/75/150 


FS , ^ , * FS , * , * FS , 4: 
WS > Wf4 — > Wfs > Wfs — > Wf2 > Wf 1 


PSQL 


50/75/150 


FS FS FS FS FS 
WS > Wf 1 > Wf 2 > Wf3 > Wf4 > Wfs 



Table 8: Execution plans for Q8 




Unit Reorder Memory (in MB) 
Figure 7: Evaluating different optimization schemes with Q8 

of ORCL was a bit worse than BFO and CSO, although PSQL still 
performed the worst. On the other hand, the performance differ- 
ence between BFO and CSO was negligible, which showed that it 
is the numbers of cover sets and the SS, rather than their sequence 
in the chain, that affect the plan performance most significantly. 




a 2,000 



Unit Reorder Memory (in MB) 
Figure 8: Evaluating different optimization schemes with Q9 

The final tested query Q9 contained the most window functions 
and thus led to the most complicated execution plans, which are 
presented in Table 10. This time PSQL eventually succeeded in 
avoiding the FS for wf 3 and thus generated an execution plan that 
was comparable with that of ORCL. The reasons that ORCL lost 
out to BFO and CSO to a very large extent are two-fold. On the 
one hand, ORCL generated one more cover set than BFO and CSO; 
on the other hand, in essence it was not able to recognize those op- 
timization opportunities w.r.t SS. As for BFO and CSO, Table 10 
and Fig. 8 together show that their optimization effects were equal- 
ly well. The fact that CSO outperformed BFO with 50MB memory 





WPK 


WOK 




WPK 


WOK 


wf 1 


{item} 


(bill, date) 


wf 5 


{bill, date} 


(time) 


Wf 2 


{item, time} 


(date) 


wfe 


{bill} 


(time) 


wf 3 


{item} 


(time) 


wf 7 


{date, time} 


£ 


Wf 4 


(item, date) 


Wf 8 


(time) 



Table 9: Window functions contained by Q9 



Scheme 


M (in MB) 


Plan 


BFO 


50/75 


WS > Wf 1 > Wf2 — > Wl3 > VI4 

Hs , * SS FS 
> Wfs > WI 6 > Wf 7 -4 Wf g 


150 


FS, - SS SS. , 
WS > Wf 1 > Wf2 — > Wf3 > VI4 

FS, . SS PS v 
> Wf 5 > Wf 6 > Wf 7 — > Wf 8 


CSO 


50/75 


FS, . , . HS SS. 
WS > Wf 7 — > Wfs > Wf 6 > Wf 5 

FS , t , * SS, f SS v - 
> Wf2 — > Wf3 V Wf 1 > Wf4 


150 


FS , -f , t FS SS 
WS — > wf 7 — > wfs — > Wf6 — > wfs 

FS SS SS 
> Wf2 — > Wf3 > Wf 1 > Wf4 


ORCL 


50/75/150 


FS FS FS 
UJS > Wf 2 — > Wfs > Wf4 > Wf 7 

FS FS FS FS 
> Wf 1 > Wf 3 > Wf6 > Wfs 


PSQL 


50/75/150 


FS, . FS FS 
IUS > Wf 1 > Wf2 — > Wf 3 )■ WI4 

FS FS FS FS 
> Wfs > Wf 6 > Wf 7 > Wf 8 



Table 10: Execution plans for Q9 

was due to the accidental inconsistency between the actual plan ex- 
ecution costs and our cost model's estimations. 

In a summary, BFO and CSO always delivered the best execution 
plans for all the four tested queries, Q6, Q7, Q8 and Q9. ORCL 
performed much worse than BFO and CSO but significantly better 
than PSQL at the same time. 

6.3 Optimization Overheads 

In this experiment, we subsequently compared the optimization 
overheads of those four optimization schemes. We generated a list 
of window queries with different number of window functions to 
be evaluated on the web.sales table. In each window function wf 
of each query, we randomly determined the number of attributes as 
well as the attributes themselves for both WPK and WOK. The over- 
heads of different optimization schemes incurred by optimizing six 
queries, where the number of window functions ranged from 6 to 
10, are listed in Table 1 1 . 



# of wfs 
Scheme _____ 


6 


7 


8 


9 


10 


BFO 


1.56 


18.47 


9336 


489286 


9.8 x 10" 


CSO 


1.44 


3.56 


4.07 


7.49 


12.31 


ORCL 


0.99 


1.04 


1.27 


1.36 


1.49 


PSQL 


0.85 


0.91 


1.03 


1.11 


1.18 



Table 11: Optimization overheads (in millisecond) of optimiza- 
tion schemes for queries with varying numbers of window func- 
tions 

From Table 1 1 we can see that, the optimization overheads of 
both ORCL and PSQL increased slowly along with the number 
of window functions. As for CSO, its optimization overheads in- 
creased a bit faster but still remained very small. However, for 
BFO, as expected, its optimization overheads were acceptable for 
upto 7 window functions but became totally unacceptable when the 
number of window functions exceed 8. In particular, BFO took 
about 2.7 hours to derive the optimal plan for a query with 10 win- 
dow functions! 
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According to the experimental results of the last experiment, the 
effectiveness of CSO is very similar to that of BFO which is sup- 
posed to always generate optimal plans. Moreover, as illustrated 
above, CSO is much lighter-weight than BFO. As such, we can 
conclude that CSO achieves the best tradeoff between optimization 
effectiveness and optimization efficiency. 

7. RELATED WORK 

To the best of our knowledge, [5] is the only research report in the 
public domain that focuses on optimizing the evaluation of window 
functions. That work only used FS for tuple reordering. In contrast, 
we propose two new tuple reordering operations HS and SS, both of 
which are competitive alternatives to FS. The optimization scheme 
proposed in [5] also exploited the properties of WPKs and WOKs, and 
clusters window functions into Ordering Groups which are equiva- 
lent to our notion of cover sets, so as to minimize the total number 
of FS operations needed. However, our cover set-based optimiza- 
tion scheme naturally subsumes the optimization scheme of [5] and 
incorporates additional optimizations w.r.t HS and SS. In addition, 
the other two window function optimizations mentioned in [5], i.e. 
predicate pushdown for ranking functions and parallel execution of 
a single window function, are both complementary and can co-exist 
with our approaches. 

The window functions in a window query compute a set of addi- 
tional window-function columns for a single windowed table, ac- 
cording to different specifications of window partitioning and or- 
dering. Similarly, GROUPING SETS, ROLLUP and CUBE oper- 
ations, the three extensions to the GROUP BY clause, group the 
tuples of a table in multiple disparate ways, compute aggregations 
for different tuple groups, and finally concatenate all of the tuple 
groups together into a single result 6 . However, the optimization 
techniques available forthese GROUP BY extensions (e.g. [3], [10] 
and [11]) cannot be directly applied for window function evalua- 
tion. First, the window function evaluation retains the original ta- 
ble tuples, while in the evaluation of these GROUP BY extensions, 
each tuple group collapses into a single tuple. Second, the win- 
dow functions contain in a window query can be of different types, 
while in these GROUP BY extensions, a set of global aggregation 
functions are evaluated for different tuple groups. 

Research works like [15, 17, 18] have proposed optimization 
frameworks to infer the ordering and grouping properties held for 
intermediate results of query execution by using functional depen- 
dencies. They aimed at avoiding redundant sort and group op- 
erations in the query plan. In this paper, we also need to infer 
the properties of the intermediate results flowing between window 
functions, in order to determine the proper tuple reordering opera- 
tions. However, the only interesting property that we formulate is 
the relation segmentation as described in Section 3, of which both 
ordering and grouping are special cases. Moreover, in our frame- 
work, the operations that can alter the tuple ordering include the 
newly proposed HS and SS, whose behaviors are significantly dif- 
ferent from those operations considered in [15, 18], such as group 
by and join. As a result, their techniques are not directly applicable 
or extendible in our problem context. 

8. CONCLUSION 

In this paper, we have presented a comprehensive framework for 
optimizing the evaluation of window functions. We have proposed 
two new tuple reordering methods, namely Hashed Sort (HS) and 

6 Note that both ROLLUP and CUBE are special cases of 
GROUPING SETS. 



Segmented Sort (SS), that can efficiently reorder tuples for win- 
dow function evaluation. To handle complex queries involving 
multiple window functions, we also designed a light-weight cov- 
er set-based optimization scheme that generates a (near-)optimal 
window function chain for evaluating these window functions. We 
have integrated our techniques into PostgreSQL. Our extensive per- 
formance study showed that our techniques can bring substantial 
performance gains over existing window function implementations 
and optimizations. 

There are several directions for future work. First, the function- 
al dependencies existing between attributes of the windowed table 
have non-trivial impact on the optimizations for window function- 
s, and thus should be exploited further. Second, it is possible to 
develop some tailored optimizations for certain types of window 
functions, like the predicate pushdown optimization for window 
ranking functions as proposed in [5]. We plan to investigate oth- 
er kinds of window functions to identify additional optimizations 
for them. Finally, in this paper we assume a sequential evaluation 
model for window functions. However, an alternative evaluation 
model is a graph-based evaluation model, where a window func- 
tion may receive input from multiple sources and may deliver its 
output to multiple destinations. Studying the effectiveness of such 
a graph-based model is certainly in our agenda for future work. 
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